Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blockwise float8 quantizer and quantized tensor class #1513

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

kwyss-nvidia
Copy link

@kwyss-nvidia kwyss-nvidia commented Feb 27, 2025

Description

Adds pytorch and C++ quantizer and quantized tensor classes for a subchannel quantization scheme.

The classes are configurable for 128x128 blocksize and 1x128 blocksize via setting block_scaling_dim == 2,1 respectively.

Scale tensors are stored in a format emenable for matrix multiplication, however the integration of matmul is deferred as a separate story.

Fusions of quantization and DBIAS or activation functions are not yet implemented, and the dequantization is currently implemented in torch.

Tests for quantization are included in C++ and pytorch layers, with exact comparison to reference quantizer behavior as well as an attempt to hit interesting branches through the API such as tensor creation in pytorch and CPP and dequantization of row and columnwise usage.

Two CUDA kernels for quantization are included.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Pytorch/C++ Quantizer class
  • Pytorch/C++ Quantized Tensor class
  • Quantization CUDA kernels for 1x128 and 128x128 block size.
  • C++ testing of nvte_quantize API
  • python testing of quantization via tex.quantize
  • Basic Quantizer
    • 2D with tests
    • 1D with tests
    • CPP bitwise tests
    • Generalized shape coverage
  • Python Bitwise tests for Quantizer
  • Columnwise Test Coverage
    • Remove row-wise usage and check dequantize
  • Create Tensor in C++ test coverage
    • 1D
    • 2D

Checklist that can arguably can be deferred for a future MR:

  •  Pytorch API Surface
    • get/set data
    • Operations other than quant/dequant
    • View/Reshape
  • Fused DBIAS/Activation
  • Dequantize in C++

Tasks that have a dependency on a GEMM and are not included.

  • GEMM implementation in general_gemm
  • Recipe Setup
  • Layer-wise numerical testing
  • Distributed numerical testing

Test Instructions

Python tests:

pytest tests/pytorch/test_float8blockwisetensor.py
pytest tests/pytorch/test_float8_blockwise_scaling_exact.py

C++ tests:

TE_PATH=<where_is_TE>/ bash qa/L0_cppunittest/test.sh
# Wait for the build to complete.
# To run specific tests
./tests/cpp/build/operator/test_operator --gtest_filter='*FusedCastFloat8*wiseTestSuite*'

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

kwyss-nvidia and others added 2 commits February 26, 2025 15:05
The classes are configurable for 128x128 blocksize
and 1x128 blocksize via setting block_scaling_dim == 2,1 respectively.

Scale tensors are stored in a format emenable for matrix multiplication,
however the integration of matmul is deferred as a separate story.

Fusions of quantization and DBIAS or activation functions are not yet
implemented, and the dequantization is currently implemented in torch.

Tests for quantization are included in C++ and pytorch layers, with
exact comparison to reference quantizer behavior as well as an attempt
to hit interesting branches through the API such as tensor creation
in pytorch and CPP and dequantization of row and columnwise usage.

Two CUDA kernels for quantization are included, and are direct ports
of equivalents in the kitchen repository, where a subchannel recipe
has been used for end to end training.
@zhongbozhu
Copy link

Great to see this PR!

Can you leave some description about how to run your unit tests? Thank you.

@@ -96,7 +96,7 @@ def prepare_for_saving(self) -> Tuple[list[Optional[torch.Tensor]], MXFP8TensorB
"""Prepare the tensor base for saving for backward

After calling this, the tensor instance does not hold any
data.
data. Yes it does? TODO
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is being fixed in #1500.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'll track that for an example of a good pattern to follow for Float8BlockwiseQTensorBase

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants